DBpedia Abstracts: A Large-Scale, Open, Multilingual NLP Training Corpus
نویسندگان
چکیده
The ever increasing importance of machine learning in Natural Language Processing is accompanied by an equally increasing need in large-scale training and evaluation corpora. Due to its size, its openness and relative quality, the Wikipedia has already been a source of such data, but on a limited scale. This paper introduces the DBpedia Abstract Corpus, a large-scale, open corpus of annotated Wikipedia texts in six languages, featuring over 11 million texts and over 97 million entity links. The properties of the Wikipedia texts are being described, as well as the corpus creation process, its format and interesting use-cases, like Named Entity Linking training and evaluation.
منابع مشابه
An Open Distributed Architecture for Reuse and Integration of Heterogeneous NLP Components
The shift from Computational Linguistics to Language Engineering is indicative of new trends in NLP. This paper reviews two NLP engineering problems: reuse and integration, while relating these concerns to the larger context of applied NLP. It presents a software architecture which is geared to support the development of a variety of large-scale NLP applications: Information Retrieval, Corpus P...
متن کاملDBpediaNYD - A Silver Standard Benchmark Dataset for Semantic Relatedness in DBpedia
Determining the semantic relatedness (i.e., the strength of a relation) of two resources in DBpedia (or other Linked Data sources) is a problem addressed by quite a few approaches in the recent past. However, there are no large-scale benchmark datasets for comparing such approaches, and it is an open problem to determine which of the approaches work better than others. Furthermore, larget-scale...
متن کاملDirections for Exploiting Asymmetries in Multilingual Wikipedia
Multilingual Wikipedia has been used extensively for a variety Natural Language Processing (NLP) tasks. Many Wikipedia entries (people, locations, events, etc.) have descriptions in several languages. These descriptions, however, are not identical. On the contrary, descriptions in different languages created for the same Wikipedia entry can vary greatly in terms of description length and inform...
متن کاملNLP & DBpedia An Upward Knowledge Acquisition Spiral
Recently, the DBpedia community has experienced an immense increase in activity and we believe, that the time has come to explore the connection between DBpedia & Natural Language Processing (NLP) in a yet unprecedented depth. DBpedia has a long-standing tradition to provide useful data as well as a commitment to reliable Semantic Web technologies and living best practices. As the extraction of...
متن کاملIdentifying Global Representative Classes of DBpedia Ontology Through Multilingual Analysis: A Rank Aggregation Approach
Identifying the global representative parts from the multilingual pivotal ontology is important for integrating local language resources into Linked Data. We present a novel method of identifying global representative classes of DBpedia ontology based on the collective popularity, calculated by the aggregation of ranking orders from Wikipedia’s local language editions.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016